An Auto-validating Rejection Sampler 



Raazesh Sainudiint and Thomas L. York*t 

+ Department of Mathematics and * Department of Biological Statistics and Computational Bi- 
ology, Cornell University, Ithaca, U.S.A. 

Summary. In Bayesian statistical inference and computationally intensive frequentist infer- 
ence, one is interested in obtaining samples from a high dimensional, and possibly multi-modal 
target density. The challenge is to obtain samples from this target without any knowledge of 
the normalizing constant. Several approaches to this problem rely on Monte Carlo methods. 
One of the simplest such methods is the rejection sampler due to von Neumann. Here we 
introduce an auto-validating version of the rejection sampler via interval analysis. We show 
that our rejection sampler does provide us with independent samples from a large class of 
target densities in a guaranteed manner. We illustrate the efficiency of the sampler by theory 
and by examples in up to 1 dimensions. Our sampler is immune to the 'pathologies' of some 
infamous densities including the witch's hat and can rigorously draw samples from piece-wise 
Euclidean spaces of small phylogenetic trees. 

1. INTRODUCTION 

Obtaining samples from a density p{9) = p*(6)/N p , where 6 G © and © is a compact 
Euclidean subset, i.e., C K n , without any knowledge of the normalizing constant N p = 
J & p*(0) d6, is a basic problem in Bayesian inference and multivariate simulation. Several 
approaches to this problem rely on computationally-intensive Monte Carlo methods through 
conventional floating-point arithmetic. We will concentrate on the rejection sampler due to 
von Neumann [4j|. After a brief introduction to the rejection sampler (RS) in Section[2l an 
interval version of this sampler is formalized in Section 03 This sampler is referred to as the 
Moore rejection sampler (MRS) in honor of Ramon E. Moore who was one of the influential 
founders of interval analysis [34| . A brief introduction to interval analysis, a prerequisite 
to understanding MRS, as well as the notational conventions and background assumed in 
the rest of the paper, are given in Section [7] for readers who are new to interval methods. 
In Section Lemma [TJ shows that MRS produces independent samples from the desired 
target density and Lemma [2] describes the asymptotics of the acceptance probability for a 
refining family of MRSs. Examples demonstrating the robustness and efficiency of MRS 
to complexity and dimensionality of the target are discussed in Section [4] We conclude in 
Section [5l Our sampler is an adaptive and auto- validating von Neumann rejection sampler 
that can draw independent samples from a large class of target densities, including non-log- 
concavc, sharp-peaked, and multi-modal targets. Unlike many conventional samplers, each 
sample produced by MRS is equivalent to a computer-assisted proof that it is drawn from 
the desired target. An open source C++ class library for MRS is publicly available from 
www . stats . ox . ac . uk/~sainudii /codes. 
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Fig. 1. The characteristics of three samplers with target p = p*/N p : (1) Rejection sampler with 
proposal q = q*/N q and the envelope function /„ (2) an independent Metropolis-Hastings sampler 
(IMHS) driven by an independent base chain I with proposal qi = q*/N q * and (3) a local Metropolis- 
Hastings sampler (LMHS) driven by a local base chain L with proposal q L = ql/N q * L centered at the 
current state (open square at the bottom). 



2. Rejection Sampler (RS) 

Rejection sampling [43| is a Monte Carlo method to draw independent samples from a 
target probability distribution p{9) = p*($)/N p , where 9 £ C R". Typically the target p 
is any density that is absolutely continuous with respect to the Lebesgue measure. In most 
cases of interest we can compute the target shape p*(0) for any # £ 0, but the normalizing 
constant N p is unknown. Given a proposal density q = q* /N q and an envelope function f q 
(Figure[T]) over that satisfy certain conditions. RS can produce samples fromp as follows: 



2. 1. Rejection Sampling Algorithm 

(a) Choose a proposal density q{9) = q*(9)/N q from which independent samples can be 
drawn, N q = J & q*{9) d9 is known, and q*{9) is computable for any 9 G 0. 

(b) Find some c for which the inequality 

f q (9)^cq*(9)>p*(9)y9e® (1) 
is satisfied. The smallest such value of c is said to be optimal and denoted by c, i.e., 
c= inf{c : cq*{9) > p*{9),\/9 e 0}. 

(c) Given (I) a target shape p*(9), (II) a proposal density q(9), and (III) an envelope 
function ,f q {9), V 9 £ 0, that satisfy the above conditions, we can draw independent 
samples from the target p{9) as follows: 

(i) GENERATE T ~ q. 

(ii) DRAW H - Uniform[0, f q (T)], where f q (T) > p*{T). 
(hi) IF H < p*(T), THEN set U = T. 



(iv) RETURN to StepB 
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It is not difficult to see that U generated by the above algorithm is distributed according 
to p 2^, 4f|. Observe that the probability Ay that a point proposed according to q gets 
accepted as an independent sample from p through the envelope function f q is the ratio of 



the integrals 



N P 4 J & P*(d)d6 
fq N fq f & f q (0)d0> 



and the probability distribution over the number of samples from q to obtain one sample 
from p is geometrically distributed with mean 1/Ay p3 . [45|. Therefore, for a given p, 
we have to minimize Nf over the allowed possibilities for q and f q in order to obtain an 
efficient sampler with a high acceptance probability Ay . 

3. Moore Rejection Sampler (MRS) 

Moore rejection sampler (MRS) is an auto- validating rejection sampler (RS). It can produce 
independent samples from any target shape p* that has a well-defined natural interval 
extension P* (Definition [5]) over a compact domain 0. MRS is said to be auto- validating 
because it automatically obtains a proposal q that is easy to simulate from, and an envelope 
f q that is guaranteed to satisfy the envelope condition (fT]). MRS guarantees independent 
samples through auto-validating interval methods that also constitute the core of several 
recent computer-assisted proofs of challenging problems [2(| . 



3.1. Theory 

In summary, the defining characteristics and notations of MRS are: 

Compact domain © = [0, 8} 

Target shape p*(6) : -> R 

Target integral N p = f 9 p*(9) d9 

Target density p{6) = ^ : -> R 

Interval extension of p* P* (6) : 10 -> IM 

Proposal shape q*(6) : R 

Proposal integral N q = f & q*(6) dO 

Proposal density q{9) = ^ : -> R 

Envelope function / 9 (#) = cg*(0) 

Envelope integral N fq = f @ f q (6) dd = cN q 

Acceptance probability A^ = 

Partitionof % := { 6®, &W> }. 

If p* £ (£, the class of elementary functions (Definition [7]), and its natural interval 
extension P* is well-defined on 0, then by Theorem 2] 

Rng(p*;®) 4 p *(&) C P*(&) 4 [P* (&) ,P* (&)} 

which implies that 

P*(0) < p*(6) < P*(@), \/6 G © (2) 
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Although [P*(0),P (©)] may over-estimate the range p*(®), we can construct a naive 
MRS to draw samples from p by using the following uniform proposal and constant envelope 
in Algorithm [27X1 

= ^3%) = w )) -1 ' and 
= p*(©), 

where, d(0) = 0]) = — is the diameter of 0. A lower bound for the acceptance 
probability of this naive MRS is given by the range enclosure ratio: 

A ^ = N p = > rf(0) •£*(©) = £*(0) 

P * (0) ^P* (0 ) d(0)-P*(0) " d(0)-P*(0) P*(0)' 

Although this naive MRS can be extremely inefficient (i.e., can have a very low acceptance 
probability) for non-constant target shapes, one has the guarantee due to ([2]) that the 
necessary envelope condition (p} is satisfied. 

A natural way to improve efficiency (i.e., increase the acceptance probability) is via 
partitions. Let 1 = { 9< 2 >, 0(1*1) } be a finite partition of 0. Then by Theoremg] 
we can enclose p*(0W), the range of p* over the i-th element of 1, with the well-defined 
interval extension P* of p* over 

p*(6«) C P*(9 W ) 4 [£*(e (l) ),P*(6 (l) )], Vi G {1,2,...,|X|}. (3) 

For the given partition T we can construct a partition-specific proposal q (9) as a normalized 
simple function over 0, 

m _ 

q *(e) = (N q x) ly E,P*(Q®)l {gee W}, (4) 
where the normalizing constant is obtained from the sum 

m 

^ = E( d ( Q(i) )- p *( e(i) ))- 

i=l 

The next ingredient f q % (9) for our rejection sampler can simply be 

m _ 

/^W = E P *( 0W ) 1 {^eW } (5) 

The necessary envelope condition ([1]) is satisfied by f q ?{&) because of Now, we have 
all the ingredients to perform a more efficient partition-specific Moore rejection sampling. 
Lemma Q] shows that if the target shape p* has a well-defined natural interval extension P*, 
and if U is generated according to the steps in part ED of Algorithm 12.11 and if the proposal 
density q' l (9) and the envelope function f q i{&) are given by (j4j and j5]), respectively, then 
U is distributed according to the target p. Note that the above arguments as well as those 
in the proof of Lemma Q] naturally extend when C W 1 for n > 1. In the multivariate case, 
6« e IK" (Definition H is a box. Thus, we naturally replace the diameter of an interval 
by the volume of a box v(Q^) = nl!=i ^(©1-)- The envelopes and proposals are now simple 
functions over a partition of the domain into boxes. Analogous to the univariate case, the 
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accepted samples are uniformly distributed in the region S C K. n+1 'under' p* and 'over' 
0. Hence their density is p [451 ] . 

linearity of the integral operator and © , 



Next we bound the acceptance probability A p f = AS- for this sampler. Due to the 



= ESi/e<o 

= [ ESi (d(eW) -p*(ew)) , • p*(o w )) ]. 



Therefore, 



a p = = n p > e£i (d(eW) . p*(e(*))) 



Ei2i (d(ew) • p*(ew)) e!2i (<*(ew) • p*(e«) 



We can say something more about the lower bound for A^ by limiting ourselves to target 
shapes within ££, the Lipschitz class of elementary functions (Definition [5]) . If p* E (££ 
then we might expect the enclosure of N p to be proportional to the mesh w of the partition 
% 

w = max d(6 w ). 

ig{i,...,I} 

Lemma [5] shows that if p* E 2;^ and Itiv is a uniform partition of into W intervals, then 
the acceptance probability Ay = 1 — 0(1/W). More generally, any family of MRSs that 
construct their envelopes with P from the invoking family of refining partitions 

{ T Q : a E A } 

can be thought of as a family of rejection samplers whose envelopes descend from above on 
p* in the form of simple functions. The acceptance probability approaches 1 at a rate that is 
no slower than linearly with the mesh. We can gain geometric insight into the sampler from 
an example. The dashed lines of a given shade, depicting a simple function in Figure [TSl is a 
partition-specific envelope function ([5]) for the target shape s*(x) = — Efe=i sin ( fc ( a ~ 3 ) ) 
over the domain = [—10, 6] and its normalization gives the corresponding proposal func- 
tion ([4]). As the refinement of proceeds through uniform bisections, the partition size 
increases as 2 l , i = 1,2,3,4. Each of the corresponding envelope functions in increasing 
shades of gray can be used to draw auto- validated samples from the target s(x) over 0. 
Note how the acceptance probability increases with refinement. 



3.2. Practice 

We theoretically studied the efficiency of uniform partitions for their tractability. In prac- 
tice, we may further increase the acceptance probability for a given partition size by adap- 
tively partitioning 0. In our context, adaptive means the possible exploitation of any 
current information about the target. We can refine the current partition T a and obtain 
a finer partition T a < with an additional box by bisecting a box 0M g r £ a along the side 
with the maximal diameter. There are several ways to choose a 0(*' E 1 a for bisection. 
We explore three ways of choosing 0(*) from the current partition: (a) the box with the 
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largest volume, (b) the box with the largest diameter for its range enclosure and (c) the 
box with the largest diameter for the product of its volume and its range enclosure. When 
@W g IK™ with volume v(O^), the three schemes can be formalized as follows: 

(a) Volume-based Q^*' = argmaxi^O^) 

(6) Range-based W = argmaxd(P*(0 w )) (6) 

eWei„ 

(c) Integral-based 0W = argmax (v(Q {i) ) • d(P*(9®j) 

Given a partitioning scheme, we employ a priority queue to conduct sequential refine- 
ments of 0. This approach avoids the exhaustive argmax computations to obtain the O*-*) 
for bisection at each refinement step. A priority queue (PQ) is a container in which the 
elements may have different user-specified priorities. The priority is based on some sorting 
criterion that is applicable to the elements in the container. The PQ can be thought of as 
a collection in which the "next" element is always the one with the highest priority, i.e., 
the largest with respect to the specified sorting criterion. Since this container sorts using a 
heap which can be thought of as a binary tree, one can add or remove elements in logarith- 
mic time. This is a desirable feature of the PQ. We implement the above three refinement 
schemes through PQs based on their respective sorting criterion. The (a) the volume-based 
PQ manages the family of partitions iiw-, (b) the range-based PQ manages the family $H Q 
and (c) the integral-based PQ manages the family QJ Q . 

Once we have any partition 1 of 0, we can efficiently sample 6 ~ q 1 given by ffl in two 
steps. First we sample a box ©W g % according to the discrete distribution £(©M), 

t(eW) = ' [ " ' , ew g % (7) 
£i=Ue (l) )-^ (0 (l) ) 

and then we choose a 6 £ 8'*' uniformly at random. Sampling from large discrete distribu- 
tions (with million states or more) can be made faster by preproces sing the probabilities and 
saving the result in some convenient lookup table. This basic idea 13011 allows samples to be 
drawn rapidly. We employ a more efficient preprocessing strategy [44j | that allows samples 
to be drawn in constant time even for very large discrete distributions as implemented in 
the GNU Scientific Library (Io| . Thus, by means of priority queues and lookup tables we 
can efficiently manage our adaptive partitioning of the domain for envelope construction, 
and rapidly draw samples from the proposal distribution. We used the Mersenne Twister 
random number generator [32j in this paper. Our sampler class builds on C-XSC 2 . 0, a C++ 
class library for extended scientific computing using interval methods [2l| ■ All computations 
were done on a 2.8 GHz Pentium IV machine with 1GB RAM. Having given theoretical 
and practical considerations to our Moore rejection sampler, we are ready to draw samples 
from various targets. 

4. Discussion with Examples 

We empirically study sampler efficiency by sampling from qualitatively diverse targets since 
analytical results on efficiency are sharp only for relatively simple target parameterizations. 
In Section 14.11 we first study the relative efficiencies of MRSs managed by the three PQs 
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Table 1. Moore rejection sampling from six different Gaussian mixture 
target shapes g n truncated over ©, where n is the number of mixture 
components. 



Target 





Parameters 


9i(x) 


[-10 2 ,10 2 ] 


/ii = —5, ai = 1, and wi = 1.00 


92 (x) 


[-10 2 ,10 2 ] 


/Hi = —5, a i = 1, w\ = 0.25, ^2 = 50, 






o 2 = 0.25 


9s(x) 


[-10 2 ,10 2 ] 


Mi = —15, ^2 = —5, ^3 = 3, /i4 = 6, /us = 50, 






0"1 = (72 = (74 = 1, (73 = 0.5, <7 5 = 0.1, 






Wi = 0.15, w 2 = 0.2, ™ 3 = 0.05, w 4 = 0.1 


9si x ) 


[-10 2 ,10 2 ] 


same as g$(x), except 






(71 = (72 = (7 4 = 0.1, (7 3 = 0.05, (7 5 = 0.01 


gS(x) 


[-10 2 ,10 2 ] 


same as gn(x), except ai = (72 = (74 = 0.01, 






(7 3 = 0.005, (7 5 = 0.001 




[-10 100 ,10 100 ] 


same as gs(x) 



([6]) by sampling from univariate Gaussian mixture targets. Next, we study the effects of 
target complexity (number of components, scales and domain size) on sampler efficiency. In 
Section [321 we study the sampler behavior for a highly multi- modal two-dimensional target 
that is sensitive to a temperature parameter. Using a trivariate mixture target in Section 
14.31 we compare MRS to Monte Carlo Markov chain (MCMC) methods that rely on heuristic 
convergence diagnostics and exploit the connections between RS, importance sampler (IS) 
and independent Metropolis-Hastings sampler (IMHS) to simultaneously produce samples 
from all of them. The effect of dimensionality on sampler efficiency is studied in Section 231 
where we draw samples from multivariate targets, including the multivariate witch's hat. 
Section 14.51 extends the sampler to piece-wise Euclidean domains of tree spaces. Here we 
draw auto- validating samples of small trees (triplets and quartets) from the target likelihood 
function based on primate molecular sequence data. 



4. 1. Univariate Gaussian Mixture 

We apply MRS to targets whose shape g n is obtained from finite mixtures of n univariate 
Gaussian densities truncated over an interval 0. The means (/Vs), standard deviations 
(ci's), weights (wi's), and domains (0's) for each of the six targets studied are shown in 
Table ffl 

First, we study the efficiency of the three partitioning schemes ([6]) by Moore rejection 
sampling from 95. Figure [2] shows the empirical acceptance probability of MRS, calculated 
from up to 10,000 draws from a maximum of 100,000 trials, at each partition size |T Q | 
for each of the three different families of partitions (iW, 9\ a and 5J Q ). Thus, for a given 
partition size |T Q |, the domain interval gets adaptivcly partitioned through \% a \ — 1 
bisections by the appropriate PQ. The family of partitions 2J Q managed by the integral- 
based PQ is the most efficient as it can direct the next refining bisection towards the 
interval with the most uncertainty in its integral estimate. The efficiency of the integral- 
based scheme is even more pronounced for multivariate exponential mixtures (results not 
shown) . 

Note that Lemmas [T] and [2] guarantee that MRS produces independent draws from any 
target in 2;£ . This includes Gaussian mixture targets with any finite number of components 
truncated over any compact interval. Furthermore, the locations inside are arbitrary and 
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Fig. 2. Acceptance probability (A|J versus partition size (|T a |) for six target shapes p* = 
gi,92,g5,g's,95,g5 (TableQ} under different families of partitions: (1) volume-based iW, (2) range- 
based 9\ a and (3) integral-based 5J a (see text for description). 



the scales can be highly spiked (i.e., provided cr^ > and can be enclosed by a machine 
interval via directed rounding \19L |25|L However, the efficiency of the sampler can depend 
on (i) number of components, (ii) spikiness of peaks and (iii) domain size. We empirically 
study these effects by sampling from the six targets (Table [1) using the family of MRSs 
induced by the most efficient partitions QJ a . The acceptance probability plots (Figure [2| 
for targets gi, g 2 , and g$ illustrate the diminishing effect of the number of components on 
efficiency and those for targets g§, g' 5 and g'l , with progressively smaller variances, illustrate 
a similar effect of spikiness on efficiency at every partition size |2J Q |. Note that in both 
cases sampler efficiency quickly recovers for larger partition sizes (> 100). Next we study 
the effect of domain size. In a computer, we cannot represent the real line and are forced 
to approximate it with the entire number screen, a compact interval. Thus, the domain of 
any target is necessarily truncated in a machine. The acceptance probability plot for the 
target shape 175, that is obtained by extending the domain of 55 to a large interval of radius 
10 100 centered at 0, shows the effect of domain size. The first 700 bisections or so are spent 
on zoning in on the intervals with relatively higher probability mass. However, by 1000 
bisections our acceptance probability is almost 1. 



4.2. Bivariate Levy 

The bivariate Levy density l T (X 1 ,X 2 ) over = (0i,0 2 )' = [-100, 100]® 2 © with tem- 
perature parameter T and normalizing constant Ni T has 700 modes. Figure [3] shows Z| , 
i.e., the shape of the Levy density when T = 40 and 10,000 samples drawn from I40 us- 
ing the MRS induced by an integral-based adaptive partitioning of the domain into 150 
rectangles. This MRS produced 10, 000 independent samples in less than 10 CPU seconds 
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at an acceptance probability of about 0.01. Mixtures of bivariate Gaussian shapes yielded 
comparable results. 

l T (X 1 ,X 2 ) =^1*t, where, l* T = exp{-E(X 1 ,X 2 )/T}, (8) 

5 5 

E(X 1 ,X 2 ) = * cos ((i - l)X t + i) J2 3 cos (U + 1)^2 + i) 

4=1 3=1 

+ (Xi + 1.42513) 2 + (X 2 + 0.80032) 2 . 

As the temperature parameter T in It increases, the density approaches a uniform dis- 
tribution on 0. The density is more peaked at low values of T. Various MCMC methods 
that use local proposals tend to mix well at higher temperatures and get trapped at local 
peaks when T is small. To study the effect of temperature on our sampler's efficiency, we 
plot the empirical acceptance probability as well as the CPU seconds taken to draw 10, 000 
samples from each of four Levy targets at different temperatures (T = 1,4,40,400) as a 
function of the partition size |2J Q | (Figure 2]). The efficiency decreases as the temperature 
cools. However, across the range of T we explored, MRS can produce 10, 000 independent 
samples from It in a guaranteed manner within 10 CPU seconds with an acceptance prob- 
ability greater than 1/100. Note that it is difficult to get a Monte Carlo Markov chain to 
mix properly and even more difficult to prove convergence for such targets. 



4.3. Trivariate Needle in the Haystack 

Using the target shape h* © over © = [-10, 10]® 3 , we compare MRS to a popular MCMC 
sampler that relies on heuristics for convergence diagnosis and exploit the connection be- 
tween three Monte Carlo methods. We begin with an introduction to an MCMC sampler 
known as the Metropolis-Hastings sampler (MHS) [33|, [2(| and a commonly used statistic 
for convergence diagnosis. 

h*(x) = ^ exp{-i(0 - Mi)M) 2 } + ^ cxp{-h(x - p 2 )/^ 2 ) 2 } (9) 

(7 1 ^ (J 9 ^ 



4.3.1. Metropolis-Hastings Sampler and Convergence Diagnostics 

Given qy(6, •), a possibly dependent proposal distribution for the base Markov chain Y 
(Figure [T|), the following algorithm produces a Markov chain known as the Metropolis- 
Hastings (MH) chain on merely from the knowledge of ratios of the iorm p* (0) / p* (0') for 
any (9, 9') £ x 0. The stationary distribution of the MH chain is p. 

1 Choose an arbitrary starting point $o and set i = 0. 

2 Generate a candidate point 9' ~ qyidi, •) and u ~ £7(0, 1). 

3 Set: 



e> if u < 



otherwise 



4 Set i = i + 1 and GO TO 2 




Fig. 3. Shape of the Levy density r A0 with its 700 modes (©. 10,000 samples (points on top) from 
ho using the MRS induced by an adaptive partitioning of the domain into 150 rectangles (with gray 
boundaries). 
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1 10 100 1000 10000 100000 le+06 



Partition size (|5J«|) 

Fig. 4. Acceptance probability (A^J ) and CPU seconds versus partition size (|9J a |) for Levy targets 
It, where T is the temperature ©. 



When the base chain has an independent proposal we refer to the MH chain it drives 
according to the above algorithm as the independent Metropolis-Hastings sampler (IMHS) 
and when the base chain has a local proposal we refer to the corresponding MH chain as 
the local Metropolis-Hastings sampler (LMHS). 

Although the MH chain asymptotically approaches p, it is not trivial to know if it 
has converged even for relatively simple cases Q. One often resorts to some heuristic 
convergence diagnostics. A fairly popular diagnostic statistic [l3|, EH runs multiple MH 
chains with randomly dispersed initial conditions and compares the within (W) and between 
(B) chain variation of the sampled draws. When the ratio B/W is small enough, one can 
be fairly certain that all the chains have converged to the same distribution. Note that 
B/W in our definition is a unit translation of the statistic R = 1 + B/W, as defined in [l^ |. 

We run a MH chain with local proposal specified by a uniform cube of side 6eri centered 
at the current state. Using this LMHS we try to draw samples from the following needle in 
the haystack, i.e., h with the following parameters: 

Mi = (0, 0, 0)', M2 = (1, 1, 1)', o-i = l,o-2 = 0.006. (10) 

To diagnose convergence of the LMHS we calculate B/W for each component of x and 
assume that the chain's burn-in time (the time when the samples may be affected by the 
initial condition) has ended when B/W < 0.05 for all three components. The post burn-in 
run length, i.e., the number of samples kept after the burn-in, is set to be 100 times the 
burn-in time (typical run lengths ranged in [10000, 50000] for target h specified by (fTD")) '). 

The above convergence diagnostics are more conservative than the standard recommen- 
dations [H, EH, [24J. Figure [5] shows the results (along the x\ axis) of the above LMHS 
that relies on the B/W statistic from four randomly initialized chains. The running mean 
for each of the four chains has converged to the haystack mean of (0, 0, 0)' and completely 
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Fig. 5. The running mean for four MH chains, as well as B, W, and B/W for xi as a function of run 
length. The true mean for an is at 0.5. 



missed the needle at (1, 1, 1)'. Thus, if we relied on our convergence diagnostic B/W, which 
appears to be consistently vanishing and thus suggestive of convergence to our target h, 
we would have entirely missed the needle. Tuning the diagnostic parameters, including the 
number of chains, burn-in time, and run length, does not help diagnose true convergence 
for much sharper needles (02 < 10 -5 ) that are naturally amenable to our MRS. 

Next we compare the samples obtained from the B/W diagnosed LMHS described above 
with 10, 000 samples from MRS induced by an integral-based adaptive partitioning of 
into 1,000 boxes. We compare the two samplers on two targets: (1) a blunt needle with 
<T2 = 0.10 and (2) a sharp needle with 02 = 0.01. The other parameters of the two targets 
are the same as before (fT0|) . The results are summarized in Figure El The diagnostic B/W 
works better in diagnosing convergence to the blunt target. The bias is severe for the sharp 
needle in all 100 replicates. MRS clearly outperforms LMHS, both in terms of producing 
the true samples and in terms of CPU time (Figure [6]). Moreover, the sharpness of the 
needle only has a minor effect on the efficiency of MRS. For example, for a much sharper 
needle with a 2 = 10 -10 , the MRS induced by an integral-based adaptive partitioning of 
into just 120 cuboids, achieves an acceptance probability of 0.40. 



4.3.2. Rejection, Importance and Independent Metropolis-Hastings 

The same proposal density used in RS may be used as the proposal in importance sampler 
(IS) [23|, 31 1 or as the proposal of the independent base chain in IMHS. The latter two 
samplers are typically more efficient than RS, although in some cases the efficiency of 
IMHS can be as low as half that of RS The disadvantage of IMHS and IS (or RS) 



compared to MRS is in terms of diagnosing convergence and finding the right proposal(s), 
respectively. However, if one shares the proposal obtained through interval methods in MRS 
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LMHS(0.31) 
MRS (0.15) 

LMHS(3.36) 
MRS (0.15) 



0.6 



Fig. 6. Histograms of the mean xi from 100 replicates of the LMHS and MRS. The broken lines 
and solid lines represent targets with a blunt needle (cr 2 = 0.10) and a sharp needle (<j 2 = 0.01), 
respectively. The CPU time in seconds for each sampler is given in parenthesis. 



with IS and IMHS, then we get their Moore versions which circumvent the disadvantages 
that arise from non-rigorously constructed proposals. Indeed all three samples may be 
generated simultaneously from the same sequence of proposed values [ij; each proposed 
value would be output with its importance weight, with some subset of the proposed values 
marked as IMHS-accepted, and with some further subset of those additionally marked as 
MRS-acccpted and thereby constituting our collection of independent samples. 

Figure [7] shows the mean squared error MSE for the sampler trio as a function of the 
size of the partition that is invoking their common proposal. The sample trio is drawn from 
our target h ([9]) with the sharp needle {a 2 = 0.01). To obtain the MSE for each sampler 
with target p and proposal q, we drew xi ~ q, i = 1,...,N using MRS, where N is the 
number of samples needed to obtain 100 Moore rejection samples. For IS each of the Xi's 
were assigned the importance sampling weight Wi = p(xi) / q(xi) and the estimated mean 
fi = Y2i=i ( w i x i) fl2i=i w i- The MRS estimated mean is /i = X)i=i ^Vj/lOO, where x Ti is 
the i th MRS sample. For IMHS the mean is estimated by ft = Y^iL ri X i/(N — »'i + 1), 
where n is the index of the first MRS sample; the early samples Xi, i < 7*1 arc excluded as 
burn-in. This mean estimation was repeated 500 times to obtain j = 1, . . . , 500 for each 
sampler. Finally, the MSE was computed with the known mean /z = (0.5, 0.5, 0.5) under 
the Euclidean norm \ — fi\\ as . \ \j2j — /x|| 2 /500. 

The Figure [7] compares the three samplers and shows a typical pattern: at low ac- 
ceptance probability, IS has lowest MSE, and MRS the highest, while at high acceptance 
probability all three samplers approach the same MSE. The lower MSE of IS is due to the 
large number of (MRS-discarded) samples being appropriately weighted. Observe that such 
an auto-validating Moore importance sampler can be efficient and rigorous in estimating 
some expectation E p f(x) of interest. As the acceptance probability of MRS increases with 
refinement of the domain and the number of samples from each sampler approaches equal- 
ity, the MSE of all three samplers converge as expected. For some target shapes, e.g. the 
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Partition size (|5J Q |) 

Fig. 7. MSE of the three samplers, namely, MRS, IMHS and IS, as well as the acceptance probability 
of MRS and IMHS as a function of partition size (see text for description). 

witches hat (p~2|) . we have observed the MSE of IMHS to be greater than that of MRS, but 
by less than a factor of 2, in agreement with [28[ (results not shown). 

4.4. Multivariate Rosenbrock and Witch's Hat 

Next we examine the effect of dimensionality on efficiency of MRS through the challenging 
Rosenbrock function from the optimization literature. We make it a Rosenbrock density 
(?*£>) in D dimensions over some compact ® G VL D by appropriately normalizing the Rosen- 
brock shape r* D (fTTj) . 

D 

r* D (X) = exp{- 5>00pQ - X}_ x f + (1 - AVi) 2 )} (11) 

i=1 

Figure [S] summarizes the efficiency for various Rosenbrock densities. For the more de- 
manding nine dimensional Rosenbrock target rg, we were able to draw 10,000 samples in 
about 650 CPU seconds at an acceptance probability of 1/10,000. The acceptance proba- 
bility can be improved and/or D can be increased naively if we allowed the partition size 
to be greater than a million. Thus, the extent of RAM (random access memory) at our 
disposal ultimately determines the complexity and dimensionality of the target that can 
be rigorously sampled with MRS. However, the manner in which the natural interval ex- 
tension is constructed will greatly affect the sampler's efficiency as discussed later. The 
acceptance probability for the relatively less complicated multivariate exponential mixture 
density truncated over © = [—100, 100]® 10 is higher at 1/1000 compared to that for the 
Rosenbrock target rg even when there were 10 modes inside a 10-dimensional © (results 
not shown). Thus, the complexity of the target greatly affects efficiency. 
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Fig. 8. Acceptance probability (A^° ) and CPU time to generate 10 4 samples, as a function of 
partition size (|9J a |), for Rosenbrock targets m over © = [-10, 10]® D , where D is the dimension. 



Finally, we arrive at the infamous witch's hat density which is considered to be a patho- 



logical target for most samplers [241 ] . The density is often thought of in two dimensions as an 
m : (1 — to) mixture of a cone with center C and basal radius R and a uniform distribution 
on a rectangle. It can be easily generalized to D dimensions as follows: 



w?(X)=ml m _ cll < R} (i-il^U) 



H + (1 — to) — , where 



2 — 1 



Our formulation of the witch's hat is even more challenging than the differcntiable formu- 
lation suggested in 24[, as the gradient is over the entire brim. MRS is amenable to any 
target with a well-defined interval extension over the domain including . Mixtures of sev- 
eral sharply-peaked bivariate normals with a uniform distribution, a further generalization 
of the other formulation [24j|, pose no sampling problems to MRS. Figure [S] shows that one 
can efficiently sample from witch's hat targets by rigorously constructing envelopes through 
the natural interval extension of (jT2J) . We can even sample from the hat of an eleven di- 
mensional witch {wq°). We can also make the brim of the hat as large as [— 10 100 , -|_io 100 ]® 2 
without much trouble (u)q). Note that decreasing the radius has a similar effect as widening 
the brim, in terms of lowering the acceptance probability as a function of partition size. 
Note that we are able to sample rigorously from a range of multivariate witch's hat targets 
with reasonable partition sizes and CPU seconds. 
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Partition size (|9J Q |) 



Fig. 9. Acceptance probability and CPU time to generate 10 4 samples, versus partition size for 
witch's hat targets w®, where D is the dimension of the domain and R = lCT r is the hat's radius 
(fT2|) . The hats of all targets were centered at the two vector (2, . . . , 2). The domain for hi was 

[-10 100 , +10 100 ], but all other targets had © = [-10, 10]® D . 



4.5. Likelihood of Jukes-Cantor Triplets and Quartets 

Inferring the ancestral relationship among a set of species based on their DNA sequences 
is a basic problem in phylogenetics 4^, {jj. One can obtain the likelihood of a particular 



phylogenetic tree that relates the species of interest by superimposing a simple Markov 
model of DNA substitution due to Jukes and Cantor on that tree. The length of 
an edge (branch length) connecting two nodes (species) in the tree represents the amount 
of evolutionary time (divergence) between the two species. The likelihood function over 
trees obtained through a post-order traversal (e.g. [U) has a natural interval extension over 
boxes of trees [39j . This allows us to draw samples from the posterior distribution over some 
compact box in the tree space using our MRS. Using the data from the mitochondria of 
Chimpanzee, Gorilla, and Orangutan Q that can be summarized by 29 distinct site patterns 
[38j , we obtain the posterior distribution by normalizing the likelihood with a uniform prior 
over the biologically meaningful compact domain = [10~ 10 , 10]® 3 . 10,000 independent 
samples were drawn in 942 CPU seconds from the posterior distribution over Jukes-Cantor 
triplets, i.e. unrooted trees with three edges corresponding to the three primates emanating 
from their common ancestor. Figure [TOl shows these samples (gray points) scattered about 
the verified global MLE of the triplet [39j . 

We were able to draw samples from Jukes-Cantor quartets (unrooted trees for four taxa) 
by adding the homologous sequence of the Gibbon which resulted in 61 distinct site patterns 
[38[ . This is a more challenging problem because there are 3 distinct tree topologies for an 
unrooted quartet tree and each of these has five edges. Thus, the domain of quartets is a 
piecewise Euclidean space that arises from a fusion of 3 distinct five dimensional orthants. 
Since the post-order traversals specifying the likelihood function are topology-specific, we 
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extended the likelihood over a compact box of quartets in a topology-specific manner. The 
computational time was about a day and a half to draw 10, 000 samples from the quartet 
target due to low acceptance probability of the naive likelihood function based on distinct 
site patterns. All the samples had the same topology which grouped Chimp and Gorilla 
together, i.e. ((Chimp, Gorilla), (Orangutan, Gibbon)). The samples were again scattered 
about the verified global MLE of the quartet [III ■ The marginal triplet trees (dark dots) 
within the 10, 000 sampled quartets are also plotted in Figure [TO] Observe the influence 
of an additional taxon on the triplet estimates. This quartet likelihood function has an 
elaborate DAG (Definition [8]) with numerous operations. When the data got compressed 
into sufficient statistic through algebraic statistical methods (36|, the efficiency increased 
tremendously (for e.g. triplet efficiency increases by a factor of 3.7). This is due to the 
number of leaf nodes in the target DAG, which encode the distinct site patterns of the 
observed data into the likelihood function, getting reduced from 29 to 5 for the triplet target 
and from 61 to 15 for the quartet target [5j]. Poor sampler efficiency makes it impractical 
to sample from trees with five or more leaves. However, one could use such triplets and 
quartets drawn from the posterior distribution to stochastically amalgamate and produce 
estimates of larger trees via fast amalgamating algorithms [4l|, [27 1. A collection of large 
trees obtained through such stochastic amalgamations would account for the effect of finite 
sample sizes (sequence length) as well as the sensitivity of the amalgamating algorithm 
itself to variation in the input vector of small tree estimates. It would be interesting to 
investigate if such stochastic amalgamations can help improve mixing of MCMC algorithms 
on large tree spaces [35 1 . 



5. Conclusion 

Interval methods provide a rigorous, efficient and fairly general way of constructing enve- 
lope functions for use in rejection sampling from target densities with a well-defined interval 
extension. In particular the method allows the envelope to be drawn from a large, flexible 
family of functions (simple functions over a family of adaptively refined partitions) , and to 
be constructed in a manner that rigorously maintains the envelope property as the envelope 
function is adaptively refined. Refining the partition decreases the rejection probability at 
a rate that is no slower than linear with the mesh. The corresponding proposal density 
is easily constructed in ©(partition size) time into a data structure that allows samples 
from it to be drawn in constant time. When one substitutes conventional floating-point 
arithmetic for real arithmetic in a computer and uses discrete lattices to construct the en- 
velope and/or proposal, it is generally not possible to guarantee the envelope property and 
thereby ensure that samples are drawn from the desired target density, except in special 



cases. For example, the adaptive rejection sampler (ARS) [16|, [17[ is efficient at drawing 
independent samples only from one dimensional log-concave targets. In ARS as well as a 
subsequent generalization of it through a Metropolis sampling step [3] to one dimensional 
non-log-concave targets, one can draw samples from higher dimensional targets by Gibbs 
sampling 3, 11 1 one dimension at a time. On one hand, Gibbs sampling, being a special 



case of Metropolis-Hastings sampling [6j , is at the mercy of heuristic convergence diagnos- 
tics. On the other, proposals constructed for non-log-concave conditional densities from 
finitely many points cannot guarantee that the density has not soared between the sampled 
points. However, the construction of the Moore rejection sampler through interval meth- 
ods, that enclose the target shape over the entire real continuum in any box of the domain 
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Fig. 10. 10,000 Moore rejection samples (gray dots) from the posterior distribution over the three 
branch lengths of the unrooted phylogenetictree space of Chimpanzee, Gorilla and Orangutan based 
on their mitochondrial DNA. Marginal triplets (dark dots) of 10,000 samples from the quartet tree 
space of Chimp, Gorilla, Orangutan and Gibbon. 
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with machine-representable bounds, in a manner that rigorously accounts for all sources of 



numerical errors (see [25J, [19| for a discussion on error control), naturally guarantees that 
the Moore rejection samples arc independent draws from the desired target. Moreover, 
the target is allowed to be multivariate and/or non-log-concave with possibly 'pathological' 
behavior, as long as it has a well-defined interval extension. 

Unfortunately, the efficiency of MRS is not immune to the curse of dimensionality and 
target DAG complexity. When the DAG for the likelihood gets large, its natural inter- 
val extension can have terrible over-enclosures of the true range, which in turn forces the 
adaptive refinement of the domain to be extremely fine for efficient envelope construction. 
Thus, a naive application of interval methods to targets with large DAGs can be terribly 
inefficient. In such cases, sampler efficiency rather than rigor is the issue. Thus, one will not 
obtain samples in a reasonable time rather than produce samples from some unknown and 
undesired target. There are several ways in which efficiency can be improved for such cases. 
First, the particular structure of the target DAG should be exploited to avoid any redun- 
dant computations. For example, algebraic statistical methods can be used to find sufficient 
statistics to dissolve symmetries in the DAG as done in Section [4.5l Second, we can further 
improve efficiency by limiting ourselves to diffcrentiablc targets in C n . Tighter enclosures 
of the range p*(Q^') with P*(0W) can come from the enclosures of Taylor expansions of p* 
around the midpoint m(0W) through interval-extended automatic differentiation [37l.lll.l25j 
that can then yield tighter estimates of the integral enclosures |42j. Third, we can employ 
pre-processing to improve efficiency. For example, we can pre-enclose the range of a possibly 
rescaled p* over a partition of the domain and then obtain the enclosure of P* over some 
arbitrary through a combination of hash access and hull operations on the pre-enclosures. 
Such a pre-enclosing technique reduces not only the overestimation of target shapes with 
large DAGs but also the computational cost incurred while performing interval operations 
with processors that are optimized for floating-point arithmetic. Fourth, efficiency at the 
possible cost of rigor can also be gained (up to 30% ) by foregoing directed rounding during 
envelope construction. 

In this paper we focused on the interval extension of the simplest sampler, namely the 
rejection sampler. We also exploited the direct connections between rejection sampler, 
importance sampler and independent Metropolis-Hastings sampler to produce sample trios 
from the interval extensions of all three samplers. It would be interesting to compare other 
Monte Carlo methods to their natural interval extensions. For example, even Metropolis- 
coupled MCMC [I3| which was designed to accelerate convergence for complicated targets 
is known to converge exponentially slowly in some cases 0] • Preliminary analysis suggests 
that a non-rigorous interval extension of the local Metropolis-Hastings sampler (relying on 
heuristic convergence diagnostics) may have a higher probability of converging to the target 
when compared to its floating-point cousin. Such hybrid samplers that rely on both interval 
and local methods may efficiently produce fairly reliable samples from challenging higher 
dimensional targets. 
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7. Appendix A 

Arithmetic on intervals in IK = {[a;, y] : x < y, x, y G K} 

Definition 1 (Interval arithmetic). If the binary operator* is one of the elemen- 
tary arithmetic operations {+, — , •, /}, then we define an arithmetic on operands in IK by 

X*Y = {x*y : x E X,y eY} 

with the exception that X/Y is undefined if G Y. 

Theorem 1. Arithmetic on the pair X, Y G IK is given by: 

X + Y = [x + y,x + y] 

X-Y = \x-y,x-y\ 

X-Y = [min{xy, xy, xy, xy} , max{xj/ , xy, xy, xy}} , 

X/Y = X ■ [1/y, 1/y], provided, 0<£Y. 

Proof (cf. 0,111): Since any real arithmetic operation x -k y, where * G {+,-,-,/} and 
x, y G K, is a continuous function x k y = k(x, y) : K x K — > K, except when y = under 
/ operation. Since X and Y" arc simply connected compact intervals, so is their product 
X xY. On such a domain X x Y, the continuity of y) (except when * = / and G Y) 
ensures the attainment of a minimum, a maximum and all intermediate values. Therefore, 
with the exception of the case when ★ = / and G Y, the range X * Y has an interval form 
[min (x * y), max (x * y)], where the min and max are taken over all pairs (x,y) G X x Y. 
Fortunately, we do not have to evaluate x*y over every (x, y) G X x Y to find the global min 
and global max of *(x, y) over X xY , because the monotonicity of the *(a;, y*) in terms of 
x G X for any fixed y* G Y implies that the extremal values are attained on the boundary 
of X x Y, i.e., the set {x, y,x, and y}. Thus the theorem can be verified by examining the 
finitely many boundary cases. □ 

An extremely useful property of interval arithmetic that is a direct consequence of Def- 
inition [T] is summarized by the following theorem. 

Theorem 2 (Fundamental property of interval arithmetic). If X c X' and 
Y C Y' and * G {+, -, •, /}, then 

X*Y C X'-kY', 
where we require that ^ Y' when -k = j . 
Proof: 

X-kY = {x *y : x G X, y G Y} C {x *y : x G X', y e Y'} = X' * Y'.D 

Note that an immediate implication of Theorem [2] is that when X = x and Y = y are thin 
intervals (real numbers x and y), then X'-kY' will contain the result of the real arithmetic 
operation x * y. 

Definition 2 (Range) . Consider a real-valued function f : D — > K where the domain 
D C K™. TTie range of f over any E C Z) is represented by Rng(f; E) and defined to be the 
set 

Rng(f- 1 E)^{f(x):xeE} 

However, when the range of f over any X G IK™ such that X C D is of interest, we will 
use the short-hand f{X) for Rng(f:X). 
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Definition 3 (Interval extension of subsets of E"). For any Euclidean subset 
C R™ let us denote its interval extension by 10 and define it to be the set 



We refer the the kth interval of interval vector or box X £ BR™ by X^. 

Definition 4 (Inclusion ISOTONy). An box-valued map F : D — > M m , where D £ 
IR™, is inclusion isotonic if it satisfies the property 



Definition 5 (The natural interval extension). Consider a real-valued function 
f : D —> R given by a formula, where the domain D £ IR™ . // real constants, variables, and 
operations in f are replaced by their interval counterparts, then one obtains 



F is known as the natural interval extension of f . This extension is well-defined if we do 
not run into division by zero. 

Theorem 3 (Inclusion isotony of rational functions). Consider the rational func- 
tion f(x) = p(x)/q(x), where p and q are polynomials. Let F be its natural interval extension 
such that F(Y) is well-defined for some Y £ IR and let X, X' £ IR. Then we have 

(i) Inclusion isotony: VXCI'CF => F(X) C F(X') , and 

(ii) Range enclosure: =^ Rng(f; X) = f(X) C F{X). 

Proof (cf. [!2|): Since F(Y) is well-defined, we will not run into division by zero, and there- 
fore (i) follows from the repeated invocation of Theorem [5J We can prove (ii) by contradic- 
tion. Suppose Rng(f;X) ^ F(X). Then there exists x £ X, such that f(x) £ Rng(f;X) 
but f(x) £ F(X). This in turn implies that f(x) = F([x,x]) ^ F(X), which contradicts 
(i). Therefore, our supposition cannot be true and we have proved (ii) Rng(f; X) C F(X). 



Definition 6 (Standard functions). Piece-wise monotone functions, including ex- 
ponential, logarithm, rational power, absolute value, and trigonometric functions, constitute 
the set of standard functions 

& = { a x , log b (x), x v l q , \x\, sin(a;), cos(a;), tan(x), sinh(x), . . . , arcsin(a;), . . . }. 

Such functions have well-defined interval extensions that satisfy inclusion isotony and exact 
range enclosure, i.e., Rng(f; X) = f(X) = F(X). Consider the following definitions for the 
interval extensions for some monotone functions in & with X £ IR, 



10 = {X £ IR" : x,x £ 0} 



VICYCD 



F(X) C F(Y). 



F(X) : ID -> IR. 



□ 



exp(X) 
arctan(X) 



[exp(a;) ; exp(x)] 
[arctan(x), arctan(af)] 





if <x 
if < x 



log(X) 



[log(x),log(x)] 
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and a piece-wise monotone function in 6 with Z + and Z representing the set of positive 
and negative integers, respectively. 

if n G Z + is odd, 
if n G Z + is even, 
if n = 0, 
if n G Z";0 £ X 

Definition 7 (Elementary functions). A real-valued function that can be expressed 
as a finite combination of constants, variables, arithmetic operations, standard functions and 
compositions is called an elementary function. The set of all such elementary functions is 
referred to as <£. 

Definition 8 (Directed acyclic graph (DAG) of a function). One can think 
of the process by which an elementary function f is computed as the result of a sequence of 
recursive operations with the subexpressions fa of f where, i = 1, . . . ,n < oo. This involves 
the evaluation of the subexpression /, at node i with operands Sj ( , Si 2 from the sub-terminal 
nodes of i given by the directed acyclic graph (DAG) for f 

{fi( s i! i s i 2 ) •' if node i has 2 sub-terminal nodes Si 1 , Si 2 
fi[sii) ■' if node i has 1 sub-terminal node (13) 
I{si) : if node i is a leaf or terminal node, I(x) = x. 

The leaf or terminal node of the DAG is a constant or a variable and thus the fi for a leaf 
i is set equal to the respective constant or variable. The recursion starts at the leaves and 
terminates at the root of the DAG. The DAG for an elementary f with n sub- expressions 
fi,h, ■■■,fn is : 

{®h}U -» ©/» = /(*), (14) 

where each Qfi is computed according to l|13|) . 

For example the elementary function x-sin((a; — 3)/3) can be obtained from the terminus 
0/6 of the recursion {©/i}f = i on the DAG for / as shown in Figure [TT] It would be 
convenient if guaranteed enclosures of the range f(X) of an elementary / can be obtained 
by its natural interval extension F(X). We show that inclusion isotony does indeed hold 
for F, i.e. if X C Y, then F(X) C F(Y), and in particular, the inclusion property that 
x e X f(x) e F(X) does hold. 

Theorem 4 (The fundamental theorem of interval analysis). Consider any el- 
ementary function / € <£. Let F : Y — > BR be its natural interval extension such that F(Y) 
is well-defined for some Y £ IR and let X, X' G IM. Then we have 

(i) Inclusion isotony: VICI'CF =^ F(X) C F(X') , and 

(ii) Range enclosure: =^ Rng(f] X) = f(X) C F{X). 

Proof (cf. [i^l): Any elementary function / G (£ is defined by the recursion [14] on its sub- 
expressions fi where i G {1, . . . , n} according to its DAG. If f(x) = p(x)/q{x) is a rational 
function, then the theorem already holds by Theorem [3l and if / G 6 then the theorem 



X n = 



l(X} n ,\X\ n ] 

[1,1] 

ll/x,l/x\- n 
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Fig. 11. Recursive evaluation of the sub-expressions fx, . . . ,/ 6 on the DAG of the elementary func- 
tion f(x) = 0/ 6 = x ■ sm((x - 3)/3) 

holds because the range enclosure is exact for standard functions. Thus it suffices to show 
that if the theorem holds for /i,/2 G £, then the theorem also holds for fx * / 2 , where 
* S {+,—,/,-, o}. By o we mean the composition operator. Since the proof is analogous for 
all five operators, we only focus on the o operator. Since F is well-defined on its domain Y , 
neither the real- valued / nor any of its sub-expressions fi have singularities in its respective 
domain Yi induced by Y. In particular f 2 is continuous on any X 2 and X 2 such that 
X% Q X 2 Q Y 2 implying the compactness of F 2 (X 2 ) = W 2 and F 2 (X 2 ) = W 2 , respectively. 
By our assumption that Fx and F 2 are inclusion isotonic we have that W 2 C W 2 and also 
that 

Fi o F 2 (X 2 ) = Fx(F 2 {X 2 )) = Fx(W 2 ) C Fx[Wi) = Fi(F 2 (X 2 )) = Fx o F 2 {X 2 ) 

The range enclosure is a consequence of inclusion isotony by an argument identical to that 
given in the proof for Theorem [3] □ 

The fundamental implication of the above theorem is that it allows us to enclose the 
range of any elementary function and thereby produces an upper bound for the global 
maximum and a lower bound for the global minimum over any compact subset of the 
domain upon which the function is well-defined. We will see in the sequel that this is the 
work-horse of randomized enclosure algorithms that efficiently produce samples even from 
highly multi-modal target distributions. 

Unlike the natural interval extension of an / S 6 that produces exact range enclosures, 
the natural interval extension F(X) of an / e (E often overestimates the range f(X), but can 
be shown under mild conditions to linearly approach the range as the maximal diameter of 
the box goes to zero, i.e., i)(F(X), f(X)) < a-doo^X) for some a > 0. This implies that a 
partition of X into smaller boxes {X^\ ■ • • , gives better enclosures of f(X) through 

the union 

\j?=x F ( x{i) ) as illustrated in Fi gure I12I Next we make the above statements 

precise. 

Definition 9. A function f : D — > R is Lipschitz if there exists a Lipschitz constant 
K such that, for all x,y G D, we have \f(x) — f(y)\ < K\x — y\. We define (£o to be 



24 



Raazesh Sainudiirt and Thomas L. Yor\C 



150 




-150 - 



Fig. 12. Range enclosure of the interval extension of — J2t=i kx sin 3) ) linearly tightens with 
the mesh. 



the set of elementary functions whose sub-expressions fi, i = 1, . . . ,n at the nodes of the 
corresponding DAGs are all Lipschitz. 

Theorem 5 (Range enclosure tightens linearly with mesh). Consider a func- 
tion f : D —> M with f G (£,£. Let F be an inclusion isotonic interval extension of f such 
that F(X) is well-defined for some X 6 MR, X C /. Then there exists a positive real number 
K, depending on F and X, such that if X = U^X^ , then 



Rng(f;X) C |J F(X«) C F(X) 



r njF(X^)j <r{Rng(f;X)) + K max^riX^) 

Proof : The proof is given by an induction on the DAG for / similar to the proof of 
Theorem H (See (H). 



8. Appendix B 

Here we will study the Moore rejection sampler (MRS) carefully. Lemma [T] shows that MRS 
indeed produces independent samples from the desired target and Lemma [2] describes the 
asymptotics of the acceptance probability as the partition of the domain is refined. 
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Lemma 1. Suppose that the target shape p* has a well-defined natural interval extension 
P* . If U is generated according to the steps in part [c| of the rejection sampling algorithm, 
and if the proposal density q^(9) and the envelope function / 9 x(0) are given by ([4]) and §5§, 
respectively, then U is distributed according to the target p. 

Proof: From and observe that f q i(t) = q % (t)N q i. Let us define the following two 
subsets of M 2 , 

B q = {(t,h) :0<h< /,*(*)}, and B p = {(t,h) : < h<p*(t)}. 

First let us agree that steps [cTI and [cITl of part [c] of the rejection sampling algorithm produce 
a pair (T, H) that is uniformly distributed on B q . We can see this by letting k(t, h) denote 
the joint density of (T,H) and k(h\t) denote the conditional density of H given T = t. 
Then, 

' q % {t)k{h\t) if (t,h)eB q 
otherwise . 



k(t, h) 

Since we sample a uniform height h for a given t in Step [en] of the algorithm 
k{h\t) 



(UHt))- 1 = (q % (t)N q ^)- 1 if he [0,/,*(t)] 
otherwise. 



Therefore, 

k(t,h) 



q % {t)k(h\t) = qS(t)/(qZ(t)N q% ) = (A^i) -1 if (t,h) £ B q 
otherwise . 



Thus we have shown that the joint density of (T, H) is a uniformly distribution on B q . The 
above relationship also makes geometric sense since the volume of B q is exactly N q ? . Now, 
let (T*,H*) be an accepted point, i.e., (T*,H*) £ B p C B q . Then, the uniform distribution 
of (T, H) on B q implies the uniform distribution of (T*, H*) on B p . Since the volume of B p 
is N p , the p.d.f. of (T*, H*) is identically 1/N p on £> p and elsewhere. Hence, the marginal 
p.d.f. of U = T* is 

w (u) = ff (u) l/N p dh 
= 1/N p jf {u) ldh 

= l/N p J NMu) ldh, v p(u)=p*(u)/N p 
= p(u). □ 

Lemma 2. Let Uw be the uniform partition of = [0,0] info W intervals each of 
diameter w 

w = (t£ 

®w = [i+ii-^w, 6 + iw],i = l,...,W 

Uw = {e$,i = i,...,w}. 

and let p* £ ££ 7 then 

K w = 1 - G(l/W) 



26 Raazesh Sainudiirt and Thomas L. Yor\C 
Proof 

Then by means of Theorem [5] 

d(e#) = o(i/wo fj( P *(eW),p(eW)) = o(i/wo 

d(p*(e$)) = o(i/wo, •.• P * e <s £ 

Therefore 

|iiw| w 

i—l i=l 

and we have 

d(^E]=i^(©W)) = 0(i/^) =► aj w = i - o(i/w) 

Therefore the lower bound for the acceptance probability of MRS approaches 1 no 

slower than linearly with the refinement of by ttyy. Note that this should hold for a 
general nonuniform partition with w replaced by the mcsh.D 
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